Summary
This workflow demonstrates the mitosRNAseq pipeline which identifies several types of small RNA from sequencing data from short reads of <50 nt. The following method collects sequence-based counts, sequence locations, genomic features, and annotations. The data we use is from killifish embyros anoxia experiments.
The step-by-step workflow we follow is illistrated below and a detailed method to perform small RNA seq on an academic HPC is decsribed in this document.

The following table shows the contents of the file samples.csv that contains metadata for our samples. We start with a total of 40 fastq files.
Sample Table
|
#
|
Sample
|
Condition
|
Stage
|
Input files
|
|
1
|
31_4d_t0
|
4d_t0
|
4
|
31_4d_t0.fastq.gz
|
|
2
|
32_4d_t0
|
4d_t0
|
4
|
32_4d_t0.fastq.gz
|
|
3
|
33_4d_t0
|
4d_t0
|
4
|
33_4d_t0.fastq.gz
|
|
4
|
34_4d_t0
|
4d_t0
|
4
|
34_4d_t0.fastq.gz
|
|
5
|
35_4d_4hrA
|
4d_4hrA
|
4
|
35_4d_4hrA.fastq.gz
|
|
6
|
36_4d_4hrA
|
4d_4hrA
|
4
|
36_4d_4hrA.fastq.gz
|
|
7
|
37_4d_4hrA
|
4d_4hrA
|
4
|
37_4d_4hrA.fastq.gz
|
|
8
|
38_4d_4hrA
|
4d_4hrA
|
4
|
38_4d_4hrA.fastq.gz
|
|
9
|
39_4d_24hrA
|
4d_24hrA
|
4
|
39_4d_24hrA.fastq.gz
|
|
10
|
40_4d_24hrA
|
4d_24hrA
|
4
|
40_4d_24hrA.fastq.gz
|
|
11
|
41_4d_24hrA
|
4d_24hrA
|
4
|
41_4d_24hrA.fastq.gz
|
|
12
|
42_4d_24hrA
|
4d_24hrA
|
4
|
42_4d_24hrA.fastq.gz
|
|
13
|
43_4d_2hrR
|
4d_2hrR
|
4
|
43_4d_2hrR.fastq.gz
|
|
14
|
44_4d_2hrR
|
4d_2hrR
|
4
|
44_4d_2hrR.fastq.gz
|
|
15
|
45_4d_2hrR
|
4d_2hrR
|
4
|
45_4d_2hrR.fastq.gz
|
|
16
|
46_4d_2hrR
|
4d_2hrR
|
4
|
46_4d_2hrR.fastq.gz
|
|
17
|
47_4d_24hrR
|
4d_24hrR
|
4
|
47_4d_24hrR.fastq.gz
|
|
18
|
48_4d_24hrR
|
4d_24hrR
|
4
|
48_4d_24hrR.fastq.gz
|
|
19
|
49_4d_24hrR
|
4d_24hrR
|
4
|
49_4d_24hrR.fastq.gz
|
|
20
|
50_4d_24hrR
|
4d_24hrR
|
4
|
50_4d_24hrR.fastq.gz
|
|
21
|
1_12d_t0
|
12d_t0
|
12
|
1_12d_t0.fastq.gz
|
|
22
|
2_12d_t0
|
12d_t0
|
12
|
2_12d_t0.fastq.gz
|
|
23
|
5_12d_t0
|
12d_t0
|
12
|
5_12d_t0.fastq.gz
|
|
24
|
6_12d_t0
|
12d_t0
|
12
|
6_12d_t0.fastq.gz
|
|
25
|
10_12d_4hrA
|
12d_4hrA
|
12
|
10_12d_4hrA.fastq.gz
|
|
26
|
11_12d_4hrA
|
12d_4hrA
|
12
|
11_12d_4hrA.fastq.gz
|
|
27
|
12_12d_4hrA
|
12d_4hrA
|
12
|
12_12d_4hrA.fastq.gz
|
|
28
|
7_12d_4hrA
|
12d_4hrA
|
12
|
7_12d_4hrA.fastq.gz
|
|
29
|
13_12d_24hrA
|
12d_24hrA
|
12
|
13_12d_24hrA.fastq.gz
|
|
30
|
16_12d_24hrA
|
12d_24hrA
|
12
|
16_12d_24hrA.fastq.gz
|
|
31
|
17_12d_24hrA
|
12d_24hrA
|
12
|
17_12d_24hrA.fastq.gz
|
|
32
|
18_12d_24hrA
|
12d_24hrA
|
12
|
18_12d_24hrA.fastq.gz
|
|
33
|
19_12d_2hrR
|
12d_2hrR
|
12
|
19_12d_2hrR.fastq.gz
|
|
34
|
21_12d_2hrR
|
12d_2hrR
|
12
|
21_12d_2hrR.fastq.gz
|
|
35
|
22_12d_2hrR
|
12d_2hrR
|
12
|
22_12d_2hrR.fastq.gz
|
|
36
|
23_12d_2hrR
|
12d_2hrR
|
12
|
23_12d_2hrR.fastq.gz
|
|
37
|
25_12d_24hrR
|
12d_24hrR
|
12
|
25_12d_24hrR.fastq.gz
|
|
38
|
28_12d_24hrR
|
12d_24hrR
|
12
|
28_12d_24hrR.fastq.gz
|
|
39
|
29_12d_24hrR
|
12d_24hrR
|
12
|
29_12d_24hrR.fastq.gz
|
|
40
|
30_12d_24hrR
|
12d_24hrR
|
12
|
30_12d_24hrR.fastq.gz
|
QC for raw reads
Quality Control (QC) is done by running fastQC on raw reads to evaluate how good (or bad) our data looks.
Install software for Quality Check
conda
Set up a virtual environment in conda that has all of the Quality Control (QC) tools. We are using Miniconda. For creating sharable environments, create .yaml files for each environment you’re making.
conda create -n condafastqc fastqc
conda activate condafastqc
conda install -n multiqc
tmux
For almost every long-ish process, we use tmux to run tasks in a detached terminal window that can keep running if we want to log out or if we accidently log out.
tmux new -s fastqc
Executing
Creating a new file called fastqcr.R that has the script to run fastqc through R. We use this script to automate the process for a list of files in a directory. Running commands through various scripts also keeps a record of our commands.
if (!require("fastqcr")) {
install.packages("fastqcr", repos="http://cran.us.r-project.org")
library(fastqcr)
} else {library(fastqcr)}
fastqc(fq.dir="/path/to/directory/init_fastq_files",
qc.dir="/path/to/directory/fastqc/init_fastqc",
threads=20)
Running the script with the following command:
Rscript fastqcr.R
Press Ctrl+B and then D for detaching this tmux window. You can log out from your remote computer (HPC) and close your terminal window now while this process keeps running on your remote computer. To get back to this window, log in to your remote computer, and use the command tmux a -t fastqc. F or more information, check tmux.
We will get a fastqc HTML file in the output fastqc directory for each fastq sequence in the input init_fastq_files directory.
FastQC Output
General Statistics from multiqc
| Sample |
PercentDups |
PercentGC |
BPMedianReadLength |
MillionsSeqs |
| 10_12d_4hrA |
94.1 |
51 |
36 |
12.6 |
| 11_12d_4hrA |
92.7 |
51 |
36 |
16.2 |
| 12_12d_4hrA |
87.4 |
52 |
36 |
9.0 |
| 13_12d_24hrA |
90.5 |
53 |
36 |
8.5 |
| 16_12d_24hrA |
91.1 |
50 |
36 |
7.9 |
| 17_12d_24hrA |
92.4 |
50 |
36 |
29.9 |
| 18_12d_24hrA |
84.6 |
52 |
36 |
11.8 |
| 19_12d_2hrR |
87.6 |
52 |
36 |
11.6 |
| 1_12d_t0 |
86.4 |
51 |
36 |
8.7 |
| 21_12d_2hrR |
95.4 |
54 |
36 |
9.8 |
| 22_12d_2hrR |
80.5 |
52 |
36 |
6.7 |
| 23_12d_2hrR |
89.3 |
52 |
36 |
7.8 |
| 25_12d_24hrR |
88.1 |
53 |
36 |
13.5 |
| 28_12d_24hrR |
91.1 |
55 |
36 |
5.9 |
| 29_12d_24hrR |
91.4 |
54 |
36 |
8.3 |
| 2_12d_t0 |
91.5 |
54 |
36 |
8.3 |
| 30_12d_24hrR |
79.9 |
53 |
36 |
12.8 |
| 31_4d_t0 |
79.3 |
52 |
50 |
11.9 |
| 32_4d_t0 |
80.1 |
53 |
50 |
11.6 |
| 33_4d_t0 |
86.3 |
51 |
50 |
13.7 |
| 34_4d_t0 |
82.9 |
52 |
50 |
15.2 |
| 35_4d_4hrA |
88.4 |
52 |
50 |
11.0 |
| 36_4d_4hrA |
87.0 |
51 |
50 |
13.8 |
| 37_4d_4hrA |
84.7 |
52 |
50 |
15.6 |
| 38_4d_4hrA |
84.4 |
51 |
50 |
15.3 |
| 39_4d_24hrA |
79.7 |
53 |
50 |
12.4 |
| 40_4d_24hrA |
74.4 |
51 |
50 |
12.9 |
| 41_4d_24hrA |
79.3 |
52 |
50 |
9.7 |
| 42_4d_24hrA |
85.6 |
53 |
50 |
16.7 |
| 43_4d_2hrR |
88.3 |
52 |
50 |
13.0 |
| 44_4d_2hrR |
81.5 |
44 |
101 |
11.0 |
| 45_4d_2hrR |
87.9 |
44 |
101 |
13.8 |
| 46_4d_2hrR |
88.9 |
52 |
50 |
16.9 |
| 47_4d_24hrR |
72.8 |
45 |
101 |
14.6 |
| 48_4d_24hrR |
83.2 |
53 |
50 |
13.4 |
| 49_4d_24hrR |
87.1 |
51 |
50 |
22.8 |
| 50_4d_24hrR |
88.9 |
41 |
101 |
14.8 |
| 5_12d_t0 |
81.3 |
53 |
36 |
11.2 |
| 6_12d_t0 |
81.7 |
53 |
36 |
13.6 |
| 7_12d_4hrA |
81.4 |
52 |
36 |
10.5 |
Sequence Counts for each sample
The mean quality scores per base in the read
The average quality scores per sequence
The GC content per sequence
Sequence Duplication
Overrepresented sequences
Adapter Content
Overall Pass/Fail
Sequence length distribution
All samples had sequences of a single length (50bp , 36bp , 101bp).
Trimming
Trimming is done with Trimmomatic in our case, however, other tools like cutadapt or fastp may also be used. This tool runs through the following bash script
FILES="/disk/bioscratch/Podrab_lab/gazal/sRNA_gazal/copied_old_infiles/init_fastq_files/*.fastq.gz"
for f in $FILES
do
b=`basename $f`
echo Running trimming for the file $b
c=${b::-9}
o="$c.trim.fastq"
log="$c.out.log"
c="$c.log.txt"
java -jar /disk/bioscratch/Podrab_lab/gazal/sRNA_gazal/trim/Trimmomatic-0.39/trimmomatic-0.39.jar SE -threads 20 -phred33 -trimlog $c $f $o ILLUMINACLIP:/disk/bioscratch/Podrab_lab/gazal/sRNA_gazal/copied_old_infiles/adapter_contamination_sequences_AR3.txt:2:30:5:1:true SLIDINGWINDOW:5:15 LEADING:20 TRAILING:20 MINLEN:15 2> $log
done
| Sample |
TotalReads |
SurvivingReads |
PercentSurviving |
DroppedReads |
PercentDropped |
| 10_12d_4hrA |
12625447 |
4421375 |
35.0 |
8204072 |
65.0 |
| 11_12d_4hrA |
16238016 |
8893201 |
54.8 |
7344815 |
45.2 |
| 12_12d_4hrA |
9024387 |
7219072 |
80.0 |
1805315 |
20.0 |
| 13_12d_24hrA |
8467459 |
5182199 |
61.2 |
3285260 |
38.8 |
| 16_12d_24hrA |
7908381 |
6458185 |
81.7 |
1450196 |
18.3 |
| 17_12d_24hrA |
29872147 |
17868683 |
59.8 |
12003464 |
40.2 |
| 18_12d_24hrA |
11793810 |
8182929 |
69.4 |
3610881 |
30.6 |
| 19_12d_2hrR |
11606761 |
9574385 |
82.5 |
2032376 |
17.5 |
| 1_12d_t0 |
8698124 |
6498924 |
74.7 |
2199200 |
25.3 |
| 21_12d_2hrR |
9784686 |
3585294 |
36.6 |
6199392 |
63.4 |
| 22_12d_2hrR |
6741253 |
4728213 |
70.1 |
2013040 |
29.9 |
| 23_12d_2hrR |
7836111 |
4455897 |
56.9 |
3380214 |
43.1 |
| 25_12d_24hrR |
13534542 |
8957294 |
66.2 |
4577248 |
33.8 |
| 28_12d_24hrR |
5866632 |
2402889 |
41.0 |
3463743 |
59.0 |
| 29_12d_24hrR |
8305061 |
2923930 |
35.2 |
5381131 |
64.8 |
| 2_12d_t0 |
8344087 |
6946381 |
83.2 |
1397706 |
16.8 |
| 30_12d_24hrR |
12811149 |
8785705 |
68.6 |
4025444 |
31.4 |
| 31_4d_t0 |
11936271 |
10306761 |
86.3 |
1629510 |
13.7 |
| 32_4d_t0 |
11595360 |
7974434 |
68.8 |
3620926 |
31.2 |
| 33_4d_t0 |
13655749 |
10422782 |
76.3 |
3232967 |
23.7 |
| 34_4d_t0 |
15201813 |
11186661 |
73.6 |
4015152 |
26.4 |
| 35_4d_4hrA |
11045591 |
7821501 |
70.8 |
3224090 |
29.2 |
| 36_4d_4hrA |
13840101 |
9383343 |
67.8 |
4456758 |
32.2 |
| 37_4d_4hrA |
15579833 |
12373672 |
79.4 |
3206161 |
20.6 |
| 38_4d_4hrA |
15286687 |
12142640 |
79.4 |
3144047 |
20.6 |
| 39_4d_24hrA |
12442304 |
9755623 |
78.4 |
2686681 |
21.6 |
| 40_4d_24hrA |
12866355 |
11130799 |
86.5 |
1735556 |
13.5 |
| 41_4d_24hrA |
9720710 |
8032689 |
82.6 |
1688021 |
17.4 |
| 42_4d_24hrA |
16698310 |
12979297 |
77.7 |
3719013 |
22.3 |
| 43_4d_2hrR |
12989273 |
9158318 |
70.5 |
3830955 |
29.5 |
| 44_4d_2hrR |
10968948 |
8421811 |
76.8 |
2547137 |
23.2 |
| 45_4d_2hrR |
13812376 |
6527219 |
47.3 |
7285157 |
52.7 |
| 46_4d_2hrR |
16946884 |
12011097 |
70.9 |
4935787 |
29.1 |
| 47_4d_24hrR |
14627707 |
12933658 |
88.4 |
1694049 |
11.6 |
| 48_4d_24hrR |
13383720 |
10269804 |
76.7 |
3113916 |
23.3 |
| 49_4d_24hrR |
22803428 |
18172334 |
79.7 |
4631094 |
20.3 |
| 50_4d_24hrR |
14759393 |
7495120 |
50.8 |
7264273 |
49.2 |
| 5_12d_t0 |
11217968 |
7594023 |
67.7 |
3623945 |
32.3 |
| 6_12d_t0 |
13607922 |
9592402 |
70.5 |
4015520 |
29.5 |
| 7_12d_4hrA |
10489257 |
6515199 |
62.1 |
3974058 |
37.9 |
QC for trimmed reads
Quality Control (QC) is done for the trimmed reads with fastQC. We activate the conda environment condafastqc for this process.
if (!require("fastqcr")) {
install.packages("fastqcr", repos="http://cran.us.r-project.org")
library(fastqcr)
} else {library(fastqcr)}
fastqc(fq.dir="/path/to/directory/trimmed_fastq_files",
qc.dir="/path/to/directory/fastqc/trim_fastqc",
threads=20)
Running the script with the following command:
Rscript fastqcr.R
FastQC Output
Plots for each FASTQC metric after trimming is shown below. All the adapters were removed and overall quality looks better.
General Statistics from multiqc
| Sample |
PercentDups |
PercentGC |
BPMedianReadLength |
MillionsSeqs |
| 10_12d_4hrA |
89.4 |
50 |
22 |
4.4 |
| 11_12d_4hrA |
90.3 |
49 |
22 |
8.9 |
| 12_12d_4hrA |
88.7 |
51 |
22 |
7.2 |
| 13_12d_24hrA |
88.4 |
50 |
22 |
5.2 |
| 16_12d_24hrA |
91.9 |
48 |
22 |
6.5 |
| 17_12d_24hrA |
91.2 |
48 |
22 |
17.9 |
| 18_12d_24hrA |
84.9 |
50 |
22 |
8.2 |
| 19_12d_2hrR |
88.5 |
51 |
22 |
9.6 |
| 1_12d_t0 |
86.7 |
49 |
22 |
6.5 |
| 21_12d_2hrR |
90.5 |
47 |
22 |
3.6 |
| 22_12d_2hrR |
80.9 |
50 |
22 |
4.7 |
| 23_12d_2hrR |
86.9 |
49 |
22 |
4.5 |
| 25_12d_24hrR |
86.6 |
50 |
22 |
9.0 |
| 28_12d_24hrR |
84.3 |
50 |
22 |
2.4 |
| 29_12d_24hrR |
83.9 |
50 |
22 |
2.9 |
| 2_12d_t0 |
93.8 |
53 |
22 |
6.9 |
| 30_12d_24hrR |
78.7 |
51 |
22 |
8.8 |
| 31_4d_t0 |
80.8 |
49 |
22 |
10.3 |
| 32_4d_t0 |
77.1 |
50 |
22 |
8.0 |
| 33_4d_t0 |
86.5 |
48 |
22 |
10.4 |
| 34_4d_t0 |
81.7 |
51 |
22 |
11.2 |
| 35_4d_4hrA |
86.5 |
50 |
22 |
7.8 |
| 36_4d_4hrA |
82.7 |
49 |
22 |
9.4 |
| 37_4d_4hrA |
84.8 |
51 |
22 |
12.4 |
| 38_4d_4hrA |
84.2 |
48 |
22 |
12.1 |
| 39_4d_24hrA |
79.1 |
51 |
23 |
9.8 |
| 40_4d_24hrA |
73.6 |
50 |
27 |
11.1 |
| 41_4d_24hrA |
79.0 |
49 |
22 |
8.0 |
| 42_4d_24hrA |
85.3 |
50 |
22 |
13.0 |
| 43_4d_2hrR |
86.9 |
50 |
22 |
9.2 |
| 44_4d_2hrR |
82.1 |
50 |
20 |
8.4 |
| 45_4d_2hrR |
80.0 |
50 |
20 |
6.5 |
| 46_4d_2hrR |
88.4 |
48 |
22 |
12.0 |
| 47_4d_24hrR |
73.3 |
48 |
26 |
12.9 |
| 48_4d_24hrR |
82.6 |
52 |
22 |
10.3 |
| 49_4d_24hrR |
87.9 |
49 |
22 |
18.2 |
| 50_4d_24hrR |
84.4 |
49 |
20 |
7.5 |
| 5_12d_t0 |
81.2 |
50 |
22 |
7.6 |
| 6_12d_t0 |
82.7 |
51 |
22 |
9.6 |
| 7_12d_4hrA |
79.5 |
50 |
22 |
6.5 |
Sequence Counts for each sample
The mean quality scores per base in the read
The average quality scores per sequence
The GC content per sequence
Sequence Duplication
Overrepresented sequences
Overall Pass/Fail
Sequence length distribution
Alignment
First things first- create a conda environment that has bowtie, samtools, picard, and bedtools.
conda create -n condabowtie bowtie samtools picard bedtools
The trimmed reads are aligned to the reference genome using bowtie. To analyze reads mapping with mitochondrial genome, we perform the alignment in two steps.
- Complete Genome (with mitochondrial genome)- with algenome.fa
- Only the Mitochondrial genome which is extracted from genome.fa- almitogenome.fa
To begin alignment, we make bowtie indexes, this is done in two sets.
bowtie-index algenome.fa
bowtie-index almitogenome.fa
Alignment with genome
The alignment was done on the trimmed reads from previous section.
#!/bin/bash
FILES="/disk/bioscratch/Podrab_lab/gazal/sRNA_gazal/trim/*.fastq"
for f in $FILES
do
b=`basename $f`
echo Running trimming for the file $b
t=${b::-6}
o="$t.sam"
uo="$t.mapped.fq"
echo Output file is $o
bowtie \
-p 20 \
-t \
-k 5 \
--best \
--strata \
-e 99999 \
-v 0 \
-l 15 \
--chunkmbs 2048 \
-x /disk/bioscratch/Podrab_lab/gazal/sRNA_gazal/bowtie_index_alim/algenome \
-q $f \
--al $uo \
--sam --no-unal > $o 2> $t.bowtie.log
done
#done
The following stats show alignment QC.
This table shows the number of reads aligned in each sample (MillionAlignedReads). It also shows the number of aligments of various reads that includes multimapped reads (MillionMappedReads).
|
Sample
|
PercentAligned
|
MillionAlignedReads
|
MillionMappedReads
|
|
10_12d_4hrA
|
73.2
|
3.2
|
8.0
|
|
11_12d_4hrA
|
73.6
|
6.5
|
15.4
|
|
12_12d_4hrA
|
70.4
|
5.1
|
12.4
|
|
13_12d_24hrA
|
69.6
|
3.6
|
8.7
|
|
16_12d_24hrA
|
77.5
|
5.0
|
10.3
|
|
17_12d_24hrA
|
75.8
|
13.5
|
31.4
|
|
18_12d_24hrA
|
73.7
|
6.0
|
15.0
|
|
19_12d_2hrR
|
71.9
|
6.9
|
16.8
|
|
1_12d_t0
|
73.3
|
4.8
|
11.1
|
|
21_12d_2hrR
|
77.8
|
2.8
|
5.3
|
|
22_12d_2hrR
|
66.2
|
3.1
|
7.4
|
|
23_12d_2hrR
|
73.2
|
3.3
|
7.8
|
|
25_12d_24hrR
|
69.7
|
6.2
|
15.0
|
|
28_12d_24hrR
|
65.3
|
1.6
|
3.7
|
|
29_12d_24hrR
|
70.9
|
2.1
|
5.0
|
|
2_12d_t0
|
76.9
|
5.3
|
13.1
|
|
30_12d_24hrR
|
65.5
|
5.8
|
13.8
|
|
31_4d_t0
|
62.0
|
6.4
|
11.8
|
|
32_4d_t0
|
73.5
|
5.9
|
12.2
|
|
33_4d_t0
|
80.6
|
8.4
|
15.0
|
|
34_4d_t0
|
70.5
|
7.9
|
15.8
|
|
35_4d_4hrA
|
71.5
|
5.6
|
9.2
|
|
36_4d_4hrA
|
62.1
|
5.8
|
8.3
|
|
37_4d_4hrA
|
69.1
|
8.6
|
16.4
|
|
38_4d_4hrA
|
76.1
|
9.2
|
15.4
|
|
39_4d_24hrA
|
79.0
|
7.7
|
17.2
|
|
40_4d_24hrA
|
52.3
|
5.8
|
10.9
|
|
41_4d_24hrA
|
63.0
|
5.1
|
8.8
|
|
42_4d_24hrA
|
66.6
|
8.6
|
15.2
|
|
43_4d_2hrR
|
71.2
|
6.5
|
11.7
|
|
44_4d_2hrR
|
64.6
|
5.4
|
11.8
|
|
45_4d_2hrR
|
65.7
|
4.3
|
9.5
|
|
46_4d_2hrR
|
71.2
|
8.6
|
13.8
|
|
47_4d_24hrR
|
80.1
|
10.4
|
24.4
|
|
48_4d_24hrR
|
51.8
|
5.3
|
10.7
|
|
49_4d_24hrR
|
72.3
|
13.1
|
23.2
|
|
50_4d_24hrR
|
79.4
|
6.0
|
13.4
|
|
5_12d_t0
|
66.2
|
5.0
|
11.9
|
|
6_12d_t0
|
67.5
|
6.5
|
15.5
|
|
7_12d_4hrA
|
65.6
|
4.3
|
10.1
|
Alignment with mitochondria
The alignment was done on the trimmed reads from previous section.
The following stats show alignment QC.
#!/bin/bash
FILES="/disk/bioscratch/Podrab_lab/gazal/sRNA_gazal/trim/*.fastq"
for f in $FILES
do
b=`basename $f`
echo Running alignment for the file $b
t=${b::-6}
o="$t.sam"
uo="$t.mapped.fq"
echo Output file is $o
bowtie \
-p 20 \
-t \
-k 10 \
--best \
--strata \
-e 99999 \
-v 0 \
-l 15 \
--chunkmbs 2048 \
-x /disk/bioscratch/Podrab_lab/gazal/sRNA_gazal/bowtie_mito_index/almitogenome \
-q $f \
--al $uo \
--sam --no-unal > $o 2> $t.bowtie.log
done
#done
This table shows the number of reads aligned in each sample (MillionAlignedReads). It also shows the number of aligments of various reads that includes multimapped reads (MillionMappedReads).
|
Sample
|
PercentAligned
|
MillionAlignedReads
|
MillionMappedReads
|
|
10_12d_4hrA
|
0.34
|
0.01
|
0.02
|
|
11_12d_4hrA
|
0.47
|
0.04
|
0.04
|
|
12_12d_4hrA
|
0.41
|
0.03
|
0.03
|
|
13_12d_24hrA
|
0.50
|
0.03
|
0.03
|
|
16_12d_24hrA
|
0.52
|
0.03
|
0.03
|
|
17_12d_24hrA
|
3.25
|
0.58
|
0.59
|
|
18_12d_24hrA
|
6.22
|
0.51
|
0.52
|
|
19_12d_2hrR
|
0.92
|
0.09
|
0.09
|
|
1_12d_t0
|
1.25
|
0.08
|
0.08
|
|
21_12d_2hrR
|
0.79
|
0.03
|
0.03
|
|
22_12d_2hrR
|
4.43
|
0.21
|
0.21
|
|
23_12d_2hrR
|
2.68
|
0.12
|
0.12
|
|
25_12d_24hrR
|
4.72
|
0.42
|
0.43
|
|
28_12d_24hrR
|
0.32
|
0.01
|
0.01
|
|
29_12d_24hrR
|
4.08
|
0.12
|
0.12
|
|
2_12d_t0
|
0.25
|
0.02
|
0.02
|
|
30_12d_24hrR
|
8.18
|
0.72
|
0.73
|
|
31_4d_t0
|
4.99
|
0.51
|
0.52
|
|
32_4d_t0
|
10.12
|
0.81
|
0.82
|
|
33_4d_t0
|
1.99
|
0.21
|
0.21
|
|
34_4d_t0
|
6.72
|
0.75
|
0.77
|
|
35_4d_4hrA
|
0.43
|
0.03
|
0.03
|
|
36_4d_4hrA
|
0.59
|
0.06
|
0.06
|
|
37_4d_4hrA
|
1.10
|
0.14
|
0.14
|
|
38_4d_4hrA
|
5.67
|
0.69
|
0.70
|
|
39_4d_24hrA
|
8.32
|
0.81
|
0.83
|
|
40_4d_24hrA
|
7.77
|
0.87
|
0.88
|
|
41_4d_24hrA
|
7.50
|
0.60
|
0.61
|
|
42_4d_24hrA
|
2.13
|
0.28
|
0.28
|
|
43_4d_2hrR
|
0.61
|
0.06
|
0.06
|
|
44_4d_2hrR
|
1.00
|
0.08
|
0.09
|
|
45_4d_2hrR
|
4.69
|
0.31
|
0.31
|
|
46_4d_2hrR
|
1.07
|
0.13
|
0.13
|
|
47_4d_24hrR
|
16.68
|
2.16
|
2.19
|
|
48_4d_24hrR
|
0.18
|
0.02
|
0.02
|
|
49_4d_24hrR
|
1.49
|
0.27
|
0.28
|
|
50_4d_24hrR
|
2.22
|
0.17
|
0.17
|
|
5_12d_t0
|
6.11
|
0.46
|
0.47
|
|
6_12d_t0
|
6.47
|
0.62
|
0.63
|
|
7_12d_4hrA
|
5.39
|
0.35
|
0.36
|
Another QC that we could fo with the aligned read is with Picard. The following is the command line to do that:
FILES="/disk/bioscratch/Podrab_lab/gazal/sRNA_gazal/align_with_alim_for_anno/alim_mito_anno/*sorted.bam"
for f in $FILES
do
b=`basename $f`
echo Converting file $b
o="$b.picard.txt"
echo Output file is $o
picard -Xmx5g CollectAlignmentSummaryMetrics I=$f O=$o R=/disk/bioscratch/Podrab_lab/gazal/sRNA_gazal/copied_old_infiles/GCF_001266775.1_Austrofundulus_limnaeus-1.0_genomic_andMITO.fna
done
#done
Before we proceed, we need to sort the sam, convert it into bam and create a bam index (bai). This step ensures that we could store the alignments in acceptable binary format that takes less storage space and is compatible with subsequent steps.
FILES="/disk/bioscratch/Podrab_lab/gazal/sRNA_gazal/align_with_alim_for_anno/alim_mito_anno/*.sam"
for f in $FILES
do
b=`basename $f`
echo Converting $b to bam
o=${b::-4}
bam="$o.bam"
sortedbam="$b.sorted.bam"
echo Tmp file is $bam
samtools view -b $f > $bam
echo Output files are $sortedbam and its index
samtools sort $bam -o $sortedbam
samtools index $sortedbam
echo Deleting Tmp file $bam
rm $bam
done
#done
Using aligned reads
We will create three files:
- Summary information that has read ID, sequence, reference location, and number of alignments.
- Exact location, genomic features of the reference location that the reads mapped to via bedtools.
- Counting the readcounts from bam files.
FILES="/disk/bioscratch/Podrab_lab/gazal/sRNA_gazal/align_with_alim_for_anno/alim_mito_anno/*.bam"
for f in $FILES
do
b=`basename $f`
echo $b
o=${b::-4}
o="$o.summary.txt"
input_file="$f"
output_file="$o.summary.txt"
# Process the input file with samtools and awk
samtools view $f | awk '{print $1, $10, $3, $4, $15}' | sort | uniq > "$output_file"
echo "Output file written $output_file"
done
The result is the following data for each file. The sample 1_12d_t0 is shown. 
FILES="/disk/bioscratch/Podrab_lab/gazal/sRNA_gazal/align_with_alim_for_anno/alim_mito_anno/*.bam"
for f in $FILES
do
b=`basename $f`
echo $b
o=${b::-4}
bed="$o.bed"
out="$o.bedtools.txt"
gff="/disk/bioscratch/Podrab_lab/gazal/sRNA_gazal/mito_genome/GCF_001266775.1_Austrofundulus_limnaeus_MITOonly.gff"
bedtools intersect -a $f -b $gff -wo -f 1 -bed > $bed
awk '!seen[$4]++' $bed > tmp.csv
rm $bed
awk '{print $4, $21}' tmp.csv > $out
rm tmp.csv
echo "Output file written $out"
done
The result of above command gives us the following data for each file. The sample 1_12d_t0 is shown. 
FILES="/disk/bioscratch/Podrab_lab/gazal/sRNA_gazal/align_with_alim_for_anno/alim_mito_anno/*.bam"
for f in $FILES
do
b=`basename $f`
echo $b
o=${b::-4}
o="$o.readcount.txt"
echo Output file is $o
# Count unique occurrences of column 10
samtools view $f | awk '{print $10}'| sort | uniq -c > "$o"
echo "Output written to $o"
done
The result of above command gives us the following data for each file. The sample 1_12d_t0 is shown. 
miRTrace
Running mirtrace with the following command.
mirtrace qc --species dre --config seq_address.txt
Looking at the results:

sports
Executing sports tool on mapped reads only. We use the following command to run:
#!/bin/bash
sports.pl -i seqs_alim_mito.txt \
-o /disk/bioscratch/Podrab_lab/gazal/sRNA_gazal/align_with_alimnaeus/gen_input_fastq/SPORTS/out_sports/REDO/mito_mapped_reads \
-k \
-M 2 \
-p 20 \
-g /disk/bioscratch/Podrab_lab/gazal/sRNA_gazal/bowtie_mito_index/almitogenome \
-m /disk/bioscratch/Podrab_lab/gazal/sRNA_gazal/align_with_alimnaeus/gen_input_fastq/SPORTS/sports1.1/db/Danio_rerio/miRBase_21/miRBase_21-dre \
-r /disk/bioscratch/Podrab_lab/gazal/sRNA_gazal/align_with_alimnaeus/gen_input_fastq/SPORTS/sports1.1/db/Danio_rerio/rRNA_db/zebrafish_rRNA \
-t /disk/bioscratch/Podrab_lab/gazal/sRNA_gazal/align_with_alimnaeus/gen_input_fastq/SPORTS/sports1.1/db/Danio_rerio/GtRNAdb/danRer6-tRNAs \
-w /disk/bioscratch/Podrab_lab/gazal/sRNA_gazal/align_with_alimnaeus/gen_input_fastq/SPORTS/sports1.1/db/Danio_rerio/piRBase/piR_dre_v1.0 \
-e /disk/bioscratch/Podrab_lab/gazal/sRNA_gazal/align_with_alimnaeus/gen_input_fastq/SPORTS/sports1.1/db/Danio_rerio/Ensembl/Danio_rerio.GRCz10.ncrna \
-f /disk/bioscratch/Podrab_lab/gazal/sRNA_gazal/align_with_alimnaeus/gen_input_fastq/SPORTS/sports1.1/db/Danio_rerio/Rfam_12.3/Rfam-12.3-zebrafish
This will generate a number of output files, from which the tables generated as a result file *_output.txt for each sample are taken and are used to extract annotation of mapped reads. Top 50 rows of this table are shown for the sample 10_12d_4hrA below.

It also generates a number of plots to categorize the annotations and their length distributions for each sample.

Compiling output files
At the end, we need to compile all the output files into a single dataframe that can be exported to further do downstream processing such as statistics, differential expression, and clustering.
We provide an R script that compiles these output files:
- Annotation: From sports- *.mapped_output.txt
- Alignment summary: From .bam files with samtools- *.summary.txt
- Readcount: From .bam files with samtools and awk- *.readcount.txt
- Location: From .bam files with bedtools- *.bedtools.txt
In addition, we also need a sample design file (.csv format) that has details about the samples. This file makes it easier to do statistical analysis on the data using R packages like DESeq2, ExpressionSet, and EdgeR.
You would alter the following lines in the R script that basically tells the program where these txt files are located.
parent_dir <- "D:/PSU_work/smrnaseq/mito_map"
samples <- "D:/PSU_work/smrnaseq/mito_map/design_samples.csv"
readcount_txt_pattern <- "sorted.readcount.txt$"
output_txt_pattern <- "mapped_output.txt$"
summary_txt_pattern <- "summary.txt$"
bedtools_txt_pattern <- "bedtools.txt$"
Set pattern of filenames: these files are searched in the parent_dir recursively. Make sure that the parent_dir contains these files and the pattern should be unique to list the files that we need.
Merge results from our samples can be viewed from our github repo
Visualizing
We use the alignments with mitochondrial genome to view the aligned reads.
Jbrowse is used to explore alignments.
Files used:
- Aligned bam files for each sample and its corresponding index file (bam, bai)
- Reference genome for mitochondria (.fasta) and its index file (.fai)
- Genome feature file (.gff) for mitochondrial genome
Click to enlarge
This figure shows mitochondiral genome of 100bp length from 17100 to 17199. On this genome, we see 1) the colored basepairs, 2) Genomic features of the mentioned length, 3) Alignments from 4 samples (4d t0 replicates) that show reads aligned to the genome colored by strand while also showing the coverage histogram in grey.
---
title: "mitosRNAseq Workflow"
date: 08-15-2024
output: 
    html_notebook:
        toc: TRUE
        toc_float: true
---


##  Summary
This workflow demonstrates the mitosRNAseq pipeline which identifies several types of
small RNA from sequencing data from short reads of <50 nt. The following method collects sequence-based 
counts, sequence locations, genomic features, and annotations. 
The data we use is from killifish embyros anoxia experiments. 

The step-by-step workflow we follow is illistrated below and a 
detailed method to perform small RNA seq on an academic HPC is decsribed in this document. 

![](/home/gazal/Documents/PSU_work/sRNA_gazal/documentation/Figure 1.png)

The following table shows the contents of the file **samples.csv** that contains metadata for our samples.
We start with a total of 40 fastq files.

### Sample Table
```{r echo=FALSE, results='asis', warning=FALSE}
library(knitr)
library(readr)
library(formattable)
library(readr)
df_samples <- read_csv("samples.csv", show_col_types = FALSE)
formattable(df_samples,
        table.attr = 'style = "width:50%; font-size:90%;"'
        )
```

## QC for raw reads

Quality Control (QC) is done by running fastQC on raw reads to evaluate how good (or bad) our data looks.

### Install software for Quality Check

**conda**

Set up a virtual environment in conda that has all of the Quality Control (QC) tools. 
We are using [Miniconda](https://docs.anaconda.com/miniconda/). 
For creating sharable environments, create .yaml files for each environment you're making.

```
conda create -n condafastqc fastqc
conda activate condafastqc
conda install -n multiqc
```

**tmux**

For almost every long-ish process, 
we use tmux to run tasks in a detached terminal window that can keep running if we want to log out or if we accidently log out.
```
tmux new -s fastqc
```

### Executing

Creating a new file called **fastqcr.R** that has the script to run fastqc through R.
We use this script to automate the process for a list of files in a directory. 
Running commands through various scripts also keeps a record of our commands.

```
if (!require("fastqcr")) {
        install.packages("fastqcr", repos="http://cran.us.r-project.org")
        library(fastqcr)
} else {library(fastqcr)}
fastqc(fq.dir="/path/to/directory/init_fastq_files",
      qc.dir="/path/to/directory/fastqc/init_fastqc",
      threads=20)
```

Running the script with the following command:
```
Rscript fastqcr.R
```
Press `Ctrl+B` and then `D` for detaching this tmux window. 
You can log out from your remote computer (HPC) and close your terminal 
window now while this process keeps running on your remote computer. 
To get back to this window, log in to your remote computer, and
 use the command `tmux a -t fastqc`. F
or more information, check [tmux](https://github.com/tmux/tmux/wiki/Getting-Started). 

We will get a fastqc HTML file in the output fastqc directory for 
each fastq sequence in the input init_fastq_files directory.

### FastQC Output

**General Statistics from multiqc**
```{r echo=FALSE, results='asis', warning=FALSE}
library(readr)
library(formattable)
df_stats <- read_tsv("multiqc_init_fastq/general_stats_table.tsv", show_col_types = FALSE)
is.num <- sapply(df_stats, is.numeric)
df_stats[is.num] <- lapply(df_stats[is.num], round, 1)
formattable(df_stats, list(
    'PercentDups' = color_bar("#E9967A"),
    'PercentGC' = color_bar("#DAA520"),
    'BPMedianReadLength' = color_bar("#66CDAA"),
    'MillionsSeqs' = color_bar("#DA70D6")),
    table.attr = "class=\'table table-condensed\', style=\'width:80%; font-size:90%;\'"
)
```

**Sequence Counts for each sample**

![Sequence Counts](multiqc_init_fastq/multiqc_plots/fastqc_sequence_counts_plot.png){width=550px}

**The mean quality scores per base in the read**

![Per Base Quality](multiqc_init_fastq/multiqc_plots/fastqc_per_base_sequence_quality_plot.png){width=550px}

**The average quality scores per sequence**

![Per Sequence Quality](multiqc_init_fastq/multiqc_plots/fastqc_per_sequence_quality_scores_plot.png){width=550px}

**The GC content per sequence**

![GC conetent](multiqc_init_fastq/multiqc_plots/fastqc_per_sequence_gc_content_plot.png){width=550px}


**Sequence Duplication**

![Duplication in each sequence](multiqc_init_fastq/multiqc_plots/fastqc_sequence_duplication_levels_plot.png){width=550px}

**Overrepresented sequences**

![Total overrepresented sequences found in each sample](multiqc_init_fastq/multiqc_plots/fastqc_overrepresented_sequences_plot.png){width=550px}

**Adapter Content**

![Adapters found at sequence positions](multiqc_init_fastq/multiqc_plots/fastqc_adapter_content_plot.png){width=550px}

**Overall Pass/Fail**

![Checking if each sample passes or fails in various meteris](multiqc_init_fastq/multiqc_plots/fastqc-status-check-heatmap.png){width=550px}

**Sequence length distribution**

All samples had sequences of a single length (50bp , 36bp , 101bp).

## Trimming

Trimming is done with Trimmomatic in our case, however, other tools like cutadapt or fastp may also be used.
This tool runs through the following bash script

```
FILES="/disk/bioscratch/Podrab_lab/gazal/sRNA_gazal/copied_old_infiles/init_fastq_files/*.fastq.gz"
for f in $FILES
    do
        b=`basename $f`
        echo Running trimming for the file $b

        c=${b::-9}
        o="$c.trim.fastq"
	log="$c.out.log"
        c="$c.log.txt"
        java -jar /disk/bioscratch/Podrab_lab/gazal/sRNA_gazal/trim/Trimmomatic-0.39/trimmomatic-0.39.jar SE -threads 20 -phred33 -trimlog $c $f $o ILLUMINACLIP:/disk/bioscratch/Podrab_lab/gazal/sRNA_gazal/copied_old_infiles/adapter_contamination_sequences_AR3.txt:2:30:5:1:true SLIDINGWINDOW:5:15 LEADING:20 TRAILING:20 MINLEN:15 2> $log
    done
```

![Reads Surviving](trimmomatic/trimmomatic_plot.png){width=550px}

```{r echo=FALSE, results='asis', warning=FALSE}
library(readr)
library(formattable)
df_stats <- read_tsv("trimmomatic/trimmomatic_plot.tsv", show_col_types = FALSE)
is.num <- sapply(df_stats, is.numeric)
df_stats[is.num] <- lapply(df_stats[is.num], round, 1)
formattable(df_stats, list(
    'TotalReads' = color_bar("#BDB76B"),
    'SurvivingReads' = color_bar("#90EE90"),
    'PercentSurviving' = color_bar("#32CD32"),
    'DroppedReads' = color_bar("#F08080"),
    'PercentDropped' = color_bar("#CD5C5C")),
    table.attr = "class=\'table table-condensed\', style=\'width:100%; font-size:90%;\'"
)
```
				

## QC for trimmed reads

Quality Control (QC) is done for the trimmed reads with fastQC. 
We activate the conda environment `condafastqc` for this process.

```
if (!require("fastqcr")) {
        install.packages("fastqcr", repos="http://cran.us.r-project.org")
        library(fastqcr)
} else {library(fastqcr)}
fastqc(fq.dir="/path/to/directory/trimmed_fastq_files",
      qc.dir="/path/to/directory/fastqc/trim_fastqc",
      threads=20)
```

Running the script with the following command:
```
Rscript fastqcr.R
```

### FastQC Output

Plots for each FASTQC metric after trimming is shown below. 
All the adapters were removed and overall quality looks better.

**General Statistics from multiqc**
```{r echo=FALSE, results='asis', warning=FALSE}
library(readr)
library(formattable)
df_stats <- read_tsv("multiqc_trim_fastq/general_stats_table.tsv", show_col_types = FALSE)
is.num <- sapply(df_stats, is.numeric)
df_stats[is.num] <- lapply(df_stats[is.num], round, 1)
formattable(df_stats, list(
    'PercentDups' = color_bar("#E9967A"),
    'PercentGC' = color_bar("#DAA520"),
    'BPMedianReadLength' = color_bar("#66CDAA"),
    'MillionsSeqs' = color_bar("#DA70D6")),
    table.attr = "class=\'table table-condensed\', style=\'width:80%; font-size:90%;\'"
)
```
**Sequence Counts for each sample**

![Sequence Counts](multiqc_trim_fastq/multiqc_plots/fastqc_sequence_counts_plot.png){width=550px}

**The mean quality scores per base in the read**

![Per Base Quality](multiqc_trim_fastq/multiqc_plots/fastqc_per_base_sequence_quality_plot.png){width=550px}

**The average quality scores per sequence**

![Per Sequence Quality](multiqc_trim_fastq/multiqc_plots/fastqc_per_sequence_quality_scores_plot.png){width=550px}

**The GC content per sequence**

![GC conetent](multiqc_trim_fastq/multiqc_plots/fastqc_per_sequence_gc_content_plot.png){width=550px}


**Sequence Duplication**

![Duplication in each sequence](multiqc_trim_fastq/multiqc_plots/fastqc_sequence_duplication_levels_plot.png){width=550px}

**Overrepresented sequences**

![Total overrepresented sequences found in each sample](multiqc_trim_fastq/multiqc_plots/fastqc_overrepresented_sequences_plot.png){width=550px}

**Overall Pass/Fail**

![Checking if each sample passes or fails in various meteris](multiqc_trim_fastq/multiqc_plots/fastqc-status-check-heatmap.png){width=550px}

**Sequence length distribution**

![length distribution of sequences](multiqc_trim_fastq/multiqc_plots/fastqc_sequence_length_distribution_plot.png){width=550px}

## Alignment

First things first- create a conda environment that has bowtie, samtools, picard, and bedtools.

`conda create -n condabowtie bowtie samtools picard bedtools`

The trimmed reads are aligned to the reference genome using bowtie.
To analyze reads mapping with mitochondrial genome, we perform the alignment in two steps.

1. Complete Genome (with mitochondrial genome)- with algenome.fa
2. Only the Mitochondrial genome which is extracted from genome.fa- almitogenome.fa

To begin alignment, we make bowtie indexes, this is done in two sets.

```
bowtie-index algenome.fa
bowtie-index almitogenome.fa
```

### Alignment with genome

The alignment was done on the trimmed reads from previous section.

```
#!/bin/bash
FILES="/disk/bioscratch/Podrab_lab/gazal/sRNA_gazal/trim/*.fastq"
for f in $FILES
    do
        b=`basename $f`
        echo Running trimming for the file $b
        t=${b::-6}
        o="$t.sam"
        uo="$t.mapped.fq"
        echo Output file is $o
        bowtie \
	 -p 20 \
     	 -t \
	 -k 5 \
	 --best \
	 --strata \
	 -e 99999 \
	 -v 0 \
   	 -l 15 \
  	 --chunkmbs 2048 \
	 -x /disk/bioscratch/Podrab_lab/gazal/sRNA_gazal/bowtie_index_alim/algenome \
	 -q $f \
	 --al $uo \
	 --sam --no-unal > $o 2> $t.bowtie.log
    done
#done
```

The following stats show alignment QC. 

![Bowtie Alignment Stats with Genome](alignment/genome/bowtie1_alignment.png){width=550px}

This table shows the number of reads aligned in each sample (MillionAlignedReads). 
It also shows the number of aligments of various reads that includes multimapped reads (MillionMappedReads).

```{r echo=FALSE, results='asis', warning=FALSE}
library(readr)
library(formattable)
df_stats <- read_tsv("alignment/genome/general_stats_table.tsv", show_col_types = FALSE)
is.num <- sapply(df_stats, is.numeric)
df_stats[is.num] <- lapply(df_stats[is.num], round, 1)
formattable(df_stats, list(	
    'PercentAligned' = color_bar("#B0C4DE"),
    'MillionAlignedReads' = color_bar("#E6E6FA"),
    'MillionMappedReads' = color_bar("#778899"),
    table.attr = "class=\'table table-condensed\', style=\'width:80%; font-size:90%;\'")
)
```
### Alignment with mitochondria

The alignment was done on the trimmed reads from previous section.

The following stats show alignment QC. 

```
#!/bin/bash
FILES="/disk/bioscratch/Podrab_lab/gazal/sRNA_gazal/trim/*.fastq"	
for f in $FILES
    do
        b=`basename $f`
        echo Running alignment for the file $b
        t=${b::-6}
        o="$t.sam"
	uo="$t.mapped.fq"
        echo Output file is $o
        bowtie \
	 -p 20 \
     	 -t \
	 -k 10 \
	 --best \
	 --strata \
	 -e 99999 \
	 -v 0 \
   	 -l 15 \
  	 --chunkmbs 2048 \
	 -x /disk/bioscratch/Podrab_lab/gazal/sRNA_gazal/bowtie_mito_index/almitogenome \
	 -q $f \
	 --al $uo \
	 --sam --no-unal > $o 2> $t.bowtie.log
    done
#done
```

![Bowtie Alignment Stats with Mitochondrial Genome](alignment/mitochondria/bowtie1_alignment.png){width=550px}

This table shows the number of reads aligned in each sample (MillionAlignedReads). 
It also shows the number of aligments of various reads that includes multimapped reads (MillionMappedReads).

```{r echo=FALSE, results='asis', warning=FALSE}
library(readr)
library(formattable)
df_stats <- read_tsv("alignment/mitochondria/general_stats_table.tsv", show_col_types = FALSE)
is.num <- sapply(df_stats, is.numeric)
df_stats[is.num] <- lapply(df_stats[is.num], round, 2)
formattable(df_stats, list(	
    'PercentAligned' = color_bar("#B0C4DE"),
    'MillionAlignedReads' = color_bar("#E6E6FA"),
    'MillionMappedReads' = color_bar("#778899"),
    table.attr = "class=\'table table-condensed\', style=\'width:80%; font-size:90%;\'")
)
```
Another QC that we could fo with the aligned read is with Picard. The following is the command line to do that:

```
FILES="/disk/bioscratch/Podrab_lab/gazal/sRNA_gazal/align_with_alim_for_anno/alim_mito_anno/*sorted.bam"
for f in $FILES
    do
        b=`basename $f`
        echo Converting file $b
        o="$b.picard.txt"
        echo Output file is $o
	picard -Xmx5g CollectAlignmentSummaryMetrics I=$f O=$o R=/disk/bioscratch/Podrab_lab/gazal/sRNA_gazal/copied_old_infiles/GCF_001266775.1_Austrofundulus_limnaeus-1.0_genomic_andMITO.fna
    done
#done

```

Before we proceed, we need to sort the sam, convert it into bam and create a bam index (bai).
This step ensures that we could store the alignments in acceptable binary format that takes less storage space
and is compatible with subsequent steps.


```
FILES="/disk/bioscratch/Podrab_lab/gazal/sRNA_gazal/align_with_alim_for_anno/alim_mito_anno/*.sam"
for f in $FILES
    do
        b=`basename $f`
        echo Converting $b to bam
        o=${b::-4}
        bam="$o.bam"
	sortedbam="$b.sorted.bam"
	
	echo Tmp file is $bam
        samtools view -b $f > $bam
	
        echo Output files are $sortedbam and its index
	samtools sort $bam -o $sortedbam
	samtools index $sortedbam
	echo Deleting Tmp file $bam 
	rm $bam 
    done
#done
```

## Using aligned reads

We will create three files:

1. Summary information that has read ID, sequence, reference location, and number of alignments.
2. Exact location, genomic features of the reference location that the reads mapped to via bedtools.
3. Counting the readcounts from bam files.

```#!/bin/bash
FILES="/disk/bioscratch/Podrab_lab/gazal/sRNA_gazal/align_with_alim_for_anno/alim_mito_anno/*.bam"
for f in $FILES
    do
        b=`basename $f`
        echo $b
        o=${b::-4}
        o="$o.summary.txt"
	input_file="$f"
	output_file="$o.summary.txt"

	# Process the input file with samtools and awk
	samtools view $f | awk '{print $1, $10, $3, $4, $15}' | sort | uniq > "$output_file"
	echo "Output file written $output_file"
    done
```
The result is the following data for each file. The sample 1_12d_t0 is shown.
![](screens/summary_1_12d_t0.png)

```#!/bin/bash
FILES="/disk/bioscratch/Podrab_lab/gazal/sRNA_gazal/align_with_alim_for_anno/alim_mito_anno/*.bam"
for f in $FILES
    do
        b=`basename $f`
        echo $b
        o=${b::-4}
        bed="$o.bed"
	out="$o.bedtools.txt"
	gff="/disk/bioscratch/Podrab_lab/gazal/sRNA_gazal/mito_genome/GCF_001266775.1_Austrofundulus_limnaeus_MITOonly.gff"
	bedtools intersect -a $f -b $gff -wo -f 1 -bed > $bed
	awk '!seen[$4]++' $bed > tmp.csv
	rm $bed
	awk '{print $4, $21}' tmp.csv > $out
	rm tmp.csv
	echo "Output file written $out"
    done

```
The result of above command gives us the following data for each file. The sample 1_12d_t0 is shown.
![](screens/bedtools_1_12d_t0.png)


```#!/bin/bash
FILES="/disk/bioscratch/Podrab_lab/gazal/sRNA_gazal/align_with_alim_for_anno/alim_mito_anno/*.bam"
for f in $FILES
    do
        b=`basename $f`
        echo $b
        o=${b::-4}
        o="$o.readcount.txt"
        echo Output file is $o
	# Count unique occurrences of column 10
	samtools view $f | awk '{print $10}'| sort | uniq -c > "$o"
	echo "Output written to $o"
    done

```
The result of above command gives us the following data for each file. The sample 1_12d_t0 is shown.
![](screens/readcount_1_12d_t0.png)

### miRTrace
 Running mirtrace with the following command.

```#!/bin/bash
mirtrace qc --species dre --config seq_address.txt
```

Looking at the results:

![](mirtrace/mirtrace-length-plot.png){width=100%}
![](mirtrace/mirtrace-rnatype-plot.png){width=100%}

### sports

Executing sports tool on mapped reads only. We use the following command to run:

```
#!/bin/bash
sports.pl -i seqs_alim_mito.txt \
-o /disk/bioscratch/Podrab_lab/gazal/sRNA_gazal/align_with_alimnaeus/gen_input_fastq/SPORTS/out_sports/REDO/mito_mapped_reads \
-k \
-M 2 \
-p 20 \
-g /disk/bioscratch/Podrab_lab/gazal/sRNA_gazal/bowtie_mito_index/almitogenome \
-m /disk/bioscratch/Podrab_lab/gazal/sRNA_gazal/align_with_alimnaeus/gen_input_fastq/SPORTS/sports1.1/db/Danio_rerio/miRBase_21/miRBase_21-dre \
-r /disk/bioscratch/Podrab_lab/gazal/sRNA_gazal/align_with_alimnaeus/gen_input_fastq/SPORTS/sports1.1/db/Danio_rerio/rRNA_db/zebrafish_rRNA \
-t /disk/bioscratch/Podrab_lab/gazal/sRNA_gazal/align_with_alimnaeus/gen_input_fastq/SPORTS/sports1.1/db/Danio_rerio/GtRNAdb/danRer6-tRNAs \
-w /disk/bioscratch/Podrab_lab/gazal/sRNA_gazal/align_with_alimnaeus/gen_input_fastq/SPORTS/sports1.1/db/Danio_rerio/piRBase/piR_dre_v1.0 \
-e /disk/bioscratch/Podrab_lab/gazal/sRNA_gazal/align_with_alimnaeus/gen_input_fastq/SPORTS/sports1.1/db/Danio_rerio/Ensembl/Danio_rerio.GRCz10.ncrna \
-f /disk/bioscratch/Podrab_lab/gazal/sRNA_gazal/align_with_alimnaeus/gen_input_fastq/SPORTS/sports1.1/db/Danio_rerio/Rfam_12.3/Rfam-12.3-zebrafish
```

This will generate a number of output files, from which the tables
generated as a result file *_output.txt for each sample are taken and 
are used to extract annotation of mapped reads.
Top 50 rows of this table are shown for the sample 10_12d_4hrA below. 


![](sports/Screenshotfrom10_12d_4hrA.png){width=550px}

It also generates a number of plots to categorize the annotations 
and their length distributions for each sample.

![](sports/Screenshot10_12d_4hrA.png){width=550px}
![](sports/Screenshot10_12d_4hrA_graph.png){width=550px}

## Compiling output files

At the end, we need to compile all the output files into a single dataframe that can be exported
to further do downstream processing such as statistics, differential expression, and clustering.

We provide an [R script](scripts/merge_counts_mito.R) that compiles these output files:

1. Annotation: From **sports**- *.mapped_output.txt
2. Alignment summary: From .bam files with **samtools**- *.summary.txt
3. Readcount: From .bam files with **samtools** and **awk**- *.readcount.txt
4. Location: From .bam files with **bedtools**- *.bedtools.txt

In addition, we also need a sample design file (.csv format) that has details about the samples.
This file makes it easier to do statistical analysis on the data using R packages like 
DESeq2, ExpressionSet, and EdgeR.

You would alter the following lines in the R script that basically tells the program where these txt files are located.

```{r, eval=FALSE}
parent_dir <- "D:/PSU_work/smrnaseq/mito_map"
samples <- "D:/PSU_work/smrnaseq/mito_map/design_samples.csv"
readcount_txt_pattern <- "sorted.readcount.txt$"
output_txt_pattern <- "mapped_output.txt$"
summary_txt_pattern <- "summary.txt$"
bedtools_txt_pattern <- "bedtools.txt$"
```

Set pattern of filenames: these files are searched in the parent_dir recursively.
Make sure that the parent_dir contains these files and the pattern should be unique to list the files that we need.

Merge results from our samples can be viewed from our [github repo](https://github.com/Jpod-lab/srnadocs/tree/main)

## Visualizing
We use the alignments with mitochondrial genome to view the aligned reads.

Jbrowse is used to explore alignments.

Files used:

1. Aligned bam files for each sample and its corresponding index file (bam, bai)
3. Reference genome for mitochondria (.fasta) and its index file (.fai)
4. Genome feature file (.gff) for mitochondrial genome

![](jbrowse/jbrowse_4d_t0.svg)
<a href="jbrowse/jbrowse_4d_t0.svg" target="_blank"> Click to enlarge</a>

This figure shows mitochondiral genome of 100bp length from 17100 to 17199. On this genome, we see 
1) the colored basepairs, 2) Genomic features of the mentioned length, 3) Alignments from 4 samples (4d t0 replicates)
that show reads aligned to the genome colored by strand while also showing the coverage histogram in grey. 